OC-IA-P7 Neural Network training

This notebooks aims at locally training a neural network for sentiment analysis, before deployment on Azure.

We'll compare :

Extract data and get a shuffled balanced sample of 10 000 tweets

Split dataset

Preprocess data

Re-label sentiment feature (target)

Since for us the positive case is the case of negative/unhappy sentiment, we turn the "sentiment" column into expected values:

Clean text

Text must be cleaned before vectorization. We'll remove:

The we'll apply stemming or lemmatization to enhance the model performance. We'll compare performance of both methods through the model result. Here is an example of each preprocessing method:

Embedding

For our first try, we'll use pre-trained Word2vec English model from Gensim.

To embed whole sentences, we'll average the vectors of each word.

Our function is ready to preprocess each dataset:

First model training

Now that we have cleaned the data, we can create the model:

click here to go TensorBoard

EarlyStopping worked as expected, by stopping the training before the 50 parametered epochs to avoid overfitting.

The recall oscillates a lot, maybe another activation function could fit better?

We notice that the model converges muche faster with SELU activation function.
We used lemmtization and Word2vec embedding. Let's compare with other normalizing and embedding methods.

Find best preprocessing methods

The lemmatization with glove or fasttext vectorization seems to be the best combination, we'll use it for next steps.

Tuning hyperparameters

Since the recall is rather wobbly, we won't monitor it for tuning hyperparameters, but val_loss instead.

Now that the tuner has found good parameters, we can use them in our model:

The model gives different results on each session:

But globally we see that the accuracy does not got beyond about 75%.